A low-latency graph computer to identify metastable particles at the Large Hadron Collider for real-time analysis of potential dark matter signatures

Image recognition is a pervasive task in many information-processing environments. We present a solution to a difficult pattern recognition problem that lies at the heart of experimental particle physics. Future experiments with very high-intensity beams will produce a spray of thousands of particles in each beam-target or beam-beam collision. Recognizing the trajectories of these particles as they traverse layers of electronic sensors is a massive image recognition task that has never been accomplished in real time. We present a real-time processing solution that is implemented in a commercial field-programmable gate array using high-level synthesis. It is an unsupervised learning algorithm that uses techniques of graph computing. A prime application is the low-latency analysis of dark-matter signatures involving metastable charged particles that manifest as disappearing tracks.

graph.Graph-computing logic is used to compute discretized second derivatives ("laplacians") at each node of the middle three layers.A particle's helical trajectory is represented by a path through the graph that contains one node in each layer.As shown in 3 , reconstructing the trajectory corresponds to minimizing the (absolute value of the) laplacian at all nodes on this path.
A key insight of the algorithm is that the global minimum of all (L − 2)N 3 laplacians is not found by searching for the local minima at each node.On the contrary, the algorithm succeeds by iteratively vetoing poor trajectories, i.e. rejecting the combinations of three nodes (triplets) that correspond to a large laplacian value.We refer to this method of convergence as pruning.The fraction of triplets that are rejected at each iteration can be tuned; following 3 , this implementation uses the fraction of 50%.Faster convergence would be achieved if the rejection fraction were say 75%-whether the robustness demonstrated in 3 would be maintained for a larger rejection fraction is a topic for future study.
The number of links at a node with its two adjacent layers is initially N 2 , so the total number of links is initially (L − 1)N 2 .These links are iteratively pruned, until only links that comprise smooth paths through all layers remain.Each iteration comprises two logical operations, pruning and consensus.
At the end of the iterative procedure, multiple trajectories are found in a wedge, most of which are "ghosts" and result from combinatorial chance.Ghost tracks zig-zag and do not satisfy smoothness criteria.A quality control procedure selects the smoothest trajectory.We show that the smoothest trajectory is always that of the high-momentum particle of interest, should one exist in the wedge.In the rare circumstance that there are two or more particles of interest in the wedge, simple extensions of the algorithm allow for further selection on the basis of momentum, smoothness and the desired number of tracks; these quantities are computed in the qualitycontrol procedure.Thus it is straightforward to configure the quality control procedure to output all trajectories that satisfy trigger criteria.

Pruning
Pruning is described in 3 as follows: "The sort engine sorts the N × N list of ijk,l values in increasing magnitude.Each ijk,l value is stored as part of a tuple containing the associated j and k values which identify the corresponding triplet of hits.The sorted list of tuples is used by the scan engine to create a ranked list of j and k values, where the rank is defined as the ordinal number of first appearance in the sorted ijk,l list.Thus, a j or k value with a large rank is one that never makes a smooth trajectory, while a low rank corresponds to a smoother trajectory.In each sort cycle, the j and k values with large rank are dropped, which purges those links that are unlikely to form smooth trajectories." Here ijk,l represents the 1D or 2D laplacian value as defined in 3 for the triplet of hits (i, j, k), where hit i in layer l ∈ {1, . . ., L − 2} is linked to hit j in the next radial layer (l + 1) and hit k in the previous radial layer (l − 1) .We implement the tuple as a 24-bit integer in which the upper 16 bits store the laplacian and the lower byte stores j and k in 4 bits each.
Pruning is an iterative and distributed algorithm.The iteration count t runs from t = n to t = 1 and t → t − 1 for each successive iteration.For a given iteration, at each node i in layer l, there are approximately 2 t × 2 t possible local paths connecting the nodes in layers l − 1 , l, and l + 1 .Each local path at node (i, l) consists of two links; one link to an outer node (j, l + 1) and one link to an inner node (k, l − 1) .We denote these links as (i, l; j, l + 1) and (i, l; k, l − 1) respectively.This local path has the discrete laplacian value ijk,l .As described in 3 , pruning reduces the number of viable paths at each node to 2 t−1 × 2 t−1 , such that there are 2 t−1 surviving (i, l; j, l + 1) links and 2 t−1 surviving (i, l; k, l − 1) links.
A highlight of this paper is the implementation of the sort and scan engines that is fast, modular, parallelizable and amenable to pipelining.We prove (see section "Sort and scan engines") that the combination of the sort and scan engines is mathematically equivalent to: (a) for each (i, l), construct a matrix of tuples indexed by the j and k values, (b) construct a MinimumFinder that finds the tuple with the smallest laplacian value in each row of this matrix, (c) the array of these minima, indexed by row, is processed by a Minimum Set Selector (MSS) circuit, which splits the array into two halves; of the 2 t minima, the lower (upper) half contains the smallest (largest) 2 t−1 laplacians.Only the lower half are propagated to the next iteration, achieving the intended 50% rejection factor during pruning.
The symmetry between j and k (i.e.linked nodes in the adjacent sensor layers at larger and smaller radii respectively) is maintained by running in parallel a second MinimumFinder circuit on a transposed matrix of tuples.
The MSS design can be easily modified to reject say 75% at each iteration, by saving only the smallest 2 t−2 laplacians.The design is efficient because no time or FPGA resources are wasted in further sorting, which is irrelevant at a given iteration due to the iterative nature of the algorithm.

Consensus
As described in 3 , the consensus protocol is another crucial insight contributing to the success of the algorithm.The consensus protocol enables the local decisions at each node to be propagated to their linked nodes in adjacent layers so that the algorithm ultimately converges to the globally smoothest path.The consensus protocol is invoked after each iteration of pruning.Information percolates over time from each layer to more and more distant layers and a global vision over all layers is eventually achieved.In concert, all heavy-duty computations in the pruning step are local and distributed across all the nodes to be executed in parallel with low latency and high throughput.
In the consensus protocol, each link (a, l 1 ; b, l 2 ) is compared with its partner link (b, l 2 ; a, l 1 ) as maintained by the two respective nodes (a, l 1 ) and (b, l 2 ) .If either link has been pruned by its respective node, the partner link is also eliminated from its linked node.The consensus protocol ensures that all surviving links are bi-directional, Vol:.( 1234567890

Quality control
At the end of this iterative algorithm, any surviving global path of length L − 1 provides a linked list of nodes that serves as a reconstructed track 3 .Multiple tracks may be found in a wedge, most of which are ghosts.There is no assurance yet of the track quality-the goal of pruning and consensus is to find the smoothest possible tracks without any a-priori threshold on the smoothness.A subsequent quality control procedure has been described in 3 .For both the first and second (signed) derivatives, crookedness is defined as the largest difference between any pair of nodes along a track.For example, the second derivative of a zig-zag track changes sign and will likely have a large value of crookedness.These metrics are computed separately for each dimension of 2D tracks.If the same track has the smallest value of crookedness for each of the four metrics, it is labelled as the smoothest track and selected as the final output of the algorithm.
Useful byproducts of the quality control procedure are the selected track's curvature (inverse transverse momentum) and polar direction, as well as the four metrics of track quality.Trigger decision criteria can subsequently be applied to these quantities.It is straightforward to add a simple circuit to compute the track's azimuthal angle.
A possibility considered and resolved in 3 is the intersection of two tracks.The solution involves an intervention after the second-last iteration to check for two smooth trajectories passing through a node.As the pruning executes at each node simultaneously, the required actions can be inserted into each node engine.Since the intersection of two smooth trajectories is a rare occurrence and can be resolved with a small addendum, the circuitry required for this intervention will be discussed in a future paper.

Implementation
In this section we discuss the implementation details of the hardware modules.As shown in Fig. 1, the data flow through the following modules in sequence; laplacian calculator (LC), minimum finders (MF), maximum set selectors (MSS) and consensus protocol (CP).The latter three are chained n times for t = n . . . 1 .The final module is quality control (QC).
We implement the circuit using the xilinx vitis hls tool.vitis hls generates an RTL (register-transfer level) design of the digital network in Verilog and VHDL formats from its high-level C/C++ representation.These RTL formats can be used for programming an FPGA.Our results are presented using the xilinx FPGA XCVU19p-2-e, which has 4.1M lookup tables (LUT), 8.2M flip-flops (FF) and 3840 digital signal processors (DSP).All circuits are synchronous with an internal clock of 0.85 ns cycle time.Though a little faster than the recommended 1.1 ns clock cycle for this FPGA, it demonstrates the feasibility of a real-time track trigger.
In section "Resource usage" we show the hardware resource usage on the FPGA in terms of LUTs, FFs and DSPs, as well as module latencies according to the vitis synthesis.

Laplacian calculator
The computation of N 3 (L − 2) values of ijk,l from N × L coordinates is shown in 3 using weighted sums.The weighted coordinates incorporate the radial distances between layers, alignment corrections, and differences in resolution between the azimuthal and longitudinal dimensions.The weights also depend on whether the first or second derivative is being computed.For each hit there are three weighted coordinates for the three possible second derivatives (Eq.7 of 6 ), and two weighted coordinates for the forward and backward difference respectively (Eq. 6 of 6 ).These five weighted coordinates for each hit position (per dimension) can be compacted into a long integer and stored in a lookup table.
Using the weighted coordinates as inputs, the LC uses only addition and the absolute value operation to compute the N 3 (L − 2) tuples and save them in an The loops over l and i are unrolled so that the computations at each node proceed simultaneously in independent, replicated modules.In each module, the 3-term sum corresponding to the laplacian is split into two sequential pairwise sums.The latter are embedded inside a pipelined loop over j and an unrolled loop over k.A pairwise sum is performed by a DSP in one clock cycle.
The LC is designed for two-dimensional silicon sensors that measure both azimuthal ( φ ) and longitudinal (z) coordinates.We represent these coordinates as 16-bit integers, which are passed to the LC as a bit-packed 32-bit word.In the LC, both coordinates are unpacked and their second derivatives are computed in a set of parallelized and pipelined DSPs.The final steps compute and add the respective absolute values, again using DSPs, to obtain the 2D laplacian ijk,l = |φ ′′ ijk,l | + |z ′′ ijk,l | (for l ∈ {1, . . ., L − 2}) 3 , and pack the 24-bit tuple.The difference in the sensors' measurement resolution between the azimuthal and longitudinal coordinates has already been taken into account in their respective weighted values supplied to the LC.We expect 16-bit coordinates to provide adequate resolution of O(1 µ m) since wedge dimensions are expected to be smaller than 6 cm.This design results in a 4-stage pipeline with N = 2 n iterations over the pipeline, resulting in efficient (high duty factor) usage of LUTs, FFs and DSPs.With N = 16 we achieve a latency of 21 ns (24 clock cycles) for the LC.

Minimum finder
The MF architecture is a pipeline of t stages, with each stage consisting of 2 t−1 , 2 t−2 , . . ., 1 compare-and-minimize (CAM) units running in parallel.Each CAM outputs the smaller of its two input laplacians.The MF finds the minimum of 2 t inputs with a latency of 2t clock cycles.For each node (i, l) the row-wise minima of the 2D array The block diagram of a pipelined MF is shown in Fig. 2. Each MF uses 2 t − 1 CAM units.The latency of the MF is less than 29 clock cycles and reduces as both 2 t (due to pipelining) and as t (since the number of sequential internal stages s = t).

Minimum set selector
The MSS is based on Batcher's bitonic sorter 12,13 that uses compare-and-exchange (CAE) units.Each CAE sorts its two inputs into ascending order.We implement an MSS that sorts 2 t inputs minimally so that the first 2 t−1 values are the smallest.
Figure 3 shows a block diagram of a pipelined MSS.We take advantage of the pipelined design to process both the row-minima and the column-minima sequentially using a single MSS per node.It is possible to increase the duty factor by using the same MSS for multiple nodes, further increasing efficiency and reducing resource usage for a given latency requirement.The latency of the MSS is less than 29 clock cycles and reduces with t as ≈ t 2 , since the number of sequential internal stages s = 1 2 (t − 1)t + 1 .MSS uses 2 t−1 s CAE units.www.nature.com/scientificreports/

Consensus protocol
The implementation of the CP is based on an array of booleans GL[l − 1][i] [2][j] storing valid links between a node (i, l) and another node (j, l ± 1) , where the sign is stored in the third (binary) dimension.GL contains a redundancy since for each pair of nodes in adjacent layers, the status of both unidirectional partner links, one directed radially outward and the other directed radially inward, are stored.This redundancy is an important aspect of the design since it enables a completely deterministic (data-independent) architecture and latency.Consensus is imposed by setting both partner links to false if either of the partner links is false.This crucial step propagates locally-generated information in both directions along the tracks, enabling a globally-optimal decision.

Build pruned matrix
As described above, all laplacians are computed once at the beginning of the wedge data flow into the circuit and stored in TM as one 2 n × 2 n matrix per node.Starting with the second iteration of the algorithm, t < n , the MF process 2 t × 2 t matrices of surviving paths and the MSS process 2 t -length arrays.Thus we need to build pruned versions of TM for each node, TM→TMP, with the lengths of the j and k dimensions each reduced by a factor of 2 (given our rejection factor of 50%).The 2 t × 2 t TMP matrices per node serve as the inputs to MF for t < n.
Table 1.Timing performance and resource usage of various modules and sub-modules as estimated by synthesis using version 2020.2 of vitis hls.In the "block" column, "all" refers to the collection over all 3 × 16 nodes in the graph, corresponding to 3 intermediate sensor layers and 16 hits per layer.This replication of the LC, MF and MSS modules is also indicated in the "module" column.In the "pipelined function" column, "PSp" refers to the p th pipeline stage of the minimum finders, and the replication of the pipeline stages in the MSS is indicated.The pipelined functions used in LC are described in section "Resource usage".Initiation interval refers to the wait time until the circuit can process new data.Time delay in terms of the number of clock cycles is denoted by "cc", where 1 cc = 0.85 ns.The first row shows the total resources used by the entire system.www.nature.com/scientificreports/Pruning eliminates 3 4 of the local paths at each node.Therefore TM is initially a completely dense matrix and pruning and consensus increases its sparsity; with each iteration of pruning its density decreases by a factor of 4.

Figure 5.
Examples of the track-finding ability of the algorithm, demonstrated on simulated data.The C code used for vitis synthesis is executed as software to emulate the algorithm's hardware results.The red points represent the hits associated with the high-momentum particle of interest, and the blue points represent hits from random noise.The red curve shows the trajectory identified by the algorithm.The embedded particle has a transverse momentum of 10 GeV/c and traverses an axial magnetic field of 2 T. Figure 6.Results of a high-statistics C simulation test.Distributions of the smoothness metrics �φ ′′ and z ′′ and the consistency metrics �φ ′ and z ′ in the two dimensions respectively are shown for ten million simulated particles ( p T > 10 GeV).The distributions are discrete because all hit coordinates and their derivatives represented as integers.The rate of unreconstructed or reconstructed tracks, which are indicated by a value set to 10 4 for these metrics (shown in red), is 0.05%.where c refers to the curvature of the trajectory in the azimuthal dimension and refers to the cotangent of the polar angle in the longitudinal dimension.The curvature is defined as c ≡ q/p T where q is the particle charge and p T is its momentum component transverse to the beam collision axis.The curvature and distributions are generated uniformly over the intervals [−0.1, 0.1] GeV −1 and [−0.8, 0.8] respectively.The curvature resolution is 7.9 TeV −1 and the resolution is 0.25 ‰.
The purpose of the build-pruned-matrix (BPM) module is to compactify the sparsified version of TM to produce TMP which is smaller and almost completely dense.BPM uses the information on surviving links stored in the GL matrix to perform the compactification.BPM executes before MF in order to supply TMP to MF.
One of the goals of the algorithm proposed in 3 is to make the FPGA circuit architecture, as well as its throughput and latency, completely independent of features of the data.All characteristics of the circuit should be a-priori deterministic and calculable.To this end, BPM defines TMP with fixed-length dimensions based on the deterministic nature of pruning.
The data dependency is handled by the consensus protocol.Another function of BPM is to propagate this information garnered by CP.One of the sub-modules of BPM sets the laplacians to ∞ in TMP for those local paths that are eliminated by CP.In this way, the data structures and logic circuits remain data-independent; the local paths flagged by CP for elimination are removed by the next iteration of pruning.
This factorization of functions is one of the insights presented in this paper as a way to handle all data with pre-determined circuits.One of the enabling features of this implementation is redundancy of critical information.In the case of BPM, the information in GL is partially replicated by storing the node indices of surviving links in redundant arrays.In practice, the additional memory usage is minimal and the benefit is substantial.The latency of BPM is less than 10 clock cycles and reduces with t.

Quality control
The QC module consists of three sub-modules, findAllTracks (fAT), findBestTrack (fBT) and removeGhostTracks (rGT).We choose one of the L layers as the anchor layer at which tracks are defined; in practice, the layer that is radially in the middle is the most convenient.Iterating over all nodes in this layer, fAT creates a linked-list of nodes connected to each of these anchor nodes, thereby making a collection of tracks.
Next, fBT computes the four crookedness values along each of these tracks, as mentioned in section "Quality control", using the node coordinates as inputs to DSPs to calculate first and second derivatives.Batcher's bitonic sorters are used to find the smallest and the largest values of each metric; four sorters are deployed in parallel to ensure low latency.DSPs are used to calculate the crookedness values from these extrema.
Here again we encounter potential data-dependence in the number of track candidates.To eliminate data dependence, the fBT circuitry is replicated for each anchor node, regardless of whether a candidate track passes through that node.Typically, candidate tracks pass through half of the anchor nodes, implying that the other half of the fBT resources are wasted.The resource usage shown in Table 1 indicates that this cost is a small fraction Figure 8. Results of a high-statistics C simulation test on ten million random hit collections, similar to Fig. 6 but without embedding a high-momentum particle of interest.Distributions of the smoothness metrics �φ ′′ and z ′′ and the consistency metrics �φ ′ and z ′ in the two dimensions respectively are shown.The spurious trigger rate is estimated to be (0.3 ± 0.2) per million collections, where a trigger track is defined as a reconstructed track with all four quality metrics below the value of 10. www.nature.com/scientificreports/ of the total resources available.Hence we use this simple but effective solution to ensure a deterministic latency of the fBT sub-module.The array of booleans GL (see section "Consensus protocol"), which keeps a record of valid links between nodes, is used to flag and reject invalid track candidates subsequently.
For each of the four crookedness metrics, fBT deploys a MF to find the track with the smallest crookedness value.If the same track is selected by all four criteria, fBT returns this track and its parameters as the output of the circuit.
The final sub-module rGT removes the remaining (ghost) tracks from the array GL by purging their associated links.

Track parameters and metrics
As shown in 3 , the inverse of the particle's momentum transverse to the beam axis (i.e.curvature) is related to the first derivative in the azimuthal coordinates, and the particle's polar direction is related to the first derivative in the longitudinal coordinates.Since these derivatives have already been computed and sorted in the QC module, we use the average of the two median values (i.e.ignoring the extremum values) of these first derivatives to represent the best track's curvature and polar direction.These quantities are provided for subsequent trigger decisions.
Similarly, the four crookedness metrics of the best track are also provided by the QC module.Together they serve as a proxy for the χ 2 of a helical fit to the hit coordinates.These metrics can be used for subsequent rejection of ghost tracks.On the basis of these metrics, studies of the ghost rate have been shown in 3 to be low enough to meet trigger-bandwidth requirements.

Event pipeline
The LHC produces new data every 25 ns.To accomplish a real-time processing architecture, we configure the modules into blocks such that each block's latency is under 25 ns.The pipeline breaks our iterative algorithm into a sequence of smaller tasks to achieve data flow at a rate determined by the slowest task in the workflow.As shown in Fig. 4, the data flow is designed to be unidirectional with no loops or branches and hence amenable to pipelining.
We combine BPM and MF into one block, and MSS and CP into another block, so that together with LC and QC there are four types of blocks constituting the event pipeline.This grouping minimizes the number of pipeline stages, the idle time of the hardware and the total latency of the pipeline, while maintaining the 40 MHz real-time throughput.
When a collision event occurs, data from a wedge of sensors are fed into the LC block.Its output TM is available for the first MF (t = 4) before the next event arrives.We implement a "shift register" of TM such that each event's TM is accessible by all blocks processing that event sequentially (corresponding to t = 4, 3, 2, 1 ).In synchronization, the event's processed information evolves down the pipeline until the best track is generated ≈ 250 ns after the raw data were fed into the system.Since there are no loops and branches in this workflow, the event pipeline can process a continuous stream of events indefinitely.

Validation
Detailed studies of the physics case for this algorithm and its analytic performance metrics have been presented in 3 .It was shown that, for a 40 MHz beam collision rate with 200 proton-proton interactions per beam collision, the algorithm can achieve a signal efficiency > 99.9 % and a spurious trigger rate of O (10) kHz.
The thrust of this paper is the algorithm's implementation as a parallelized graph-computing architecture that has a pre-determined latency, throughput and resource usage for a pattern recognition use case that is typically considered to be non-deterministic.Since the algorithm has been re-implemented to deliver on these requirements, we demonstrate the logical consistency of this implementation by executing on simulated data the C code used for vitis synthesis.The data are simulated by embedding the hits associated with a highmomentum charged particle ( p T > 10 GeV) within a collection of randomly distributed hits.We implement multiple Coulomb scattering, which deflects the particle direction by an amount dependent on the momentum and the radiation lengths traversed.The latter is 4% for each sensor layer at normal incidence, as in 3 .Assuming 2D pixels of dimensions 50 µ m × 50 µ m, hits are smeared uniformly over a ± 25 µ m interval in each dimension to emulate digitization.Figure 5 shows examples of the software emulation, illustrating that the circuit logic correctly finds the trajectory of the particle of interest.
As mentioned in the sections describing the quality control (QC) module, our circuit returns four quality metrics as well as two physics parameters associated with the best track.The metrics referred to as �φ ′′ ( �φ ′ ) and z ′′ ( z ′ ) in 3 quantify the largest difference in the second (first) derivatives along the track.The results of a high-statistics C simulation (Fig. 6) show that the inefficiency of the algorithm on simulated data is 0.05%, and demonstrate the effectiveness of the salient feature of our algorithm; local decisions coupled with information percolation lead to the globally optimal decision.
The fidelity of the algorithm is demonstrated by comparing the curvature and the cotangent of the polar angle of the reconstructed track with the corresponding values for the simulated particle.The comparison (Fig. 7) demonstrates that tracks are reconstructed with the expected resolution and that the rate of non-Gaussian errors is negligible.
An important aspect of trigger design is the rate of spurious triggers, i.e. reconstructed tracks satisfying the trigger requirements in the absence of a true particle of interest.To estimate the spurious trigger rate for this implementation, we execute the C code on ten million collections of random hits as for Fig. 6, but without embedding a high-p T particle.The distributions of the quality metrics for (spurious) reconstructed tracks, shown in Fig. 8, are skewed toward large values.We define a trigger track as a reconstructed track whose quality metrics all have values less than 10.This selection requirement is motivated by Fig. 6 where the distributions Note that the hit resolution assumed above is for single-pixel hits; charge-sharing between adjacent pixels improves the cluster's position resolution considerably.The performance of our algorithm improves with hit resolution; to illustrate, the study is repeated with a hit resolution improved by a factor of two (rms of 7 µ m, as assumed in 3 ).For the same quality requirement on the trigger track as above, the inefficiency reduces by a factor of three, to 0.02% and the spurious trigger rate reduces by more than a factor of three, to < 0.1 per million wedges or O(10 kHz), consistent with the detailed study presented in 3 .

Discussion
As discussed in 3 , the 2D pixel sensors of the ATLAS and CMS experiments at the LHC would record O (10 5 ) hits every 25 ns.It would require a bandwidth of tens of Tbps to read out this information.An alternate approach is to install the track-finding circuitry on-detector, requiring data transmission over local detector regions only.Off-detector readout would be triggered if a high-momentum track is identified.Our design enables this edgecomputing capability; the point cloud would be partitioned into O (1000) wedges, each processed by our proposed circuit, all on-detector.Our long-term vision is the implementation of this "smart tracker" with self-triggering capability.
This edge-computing approach will require the slicing algorithm mentioned in the introduction to be implemented as a high-throughput and low-latency circuit which will operate upstream of the track-finder presented here.We note that the LUT and FF usage of the track-finder (shown in Table 1) is 80% and 50% respectively of the resources available on the chosen FPGA.We will investigate the possibility of implementing the slicing algorithm using the remaining resources, to minimize the system's footprint, power and cooling needs.
The circuit design could be ported from an FPGA to an application-specific integrated circuit (ASIC) to reduce the footprint substantially; however, as FPGAs with higher circuit density become available, a transition to ASICs may be unnecessary.The XCVU19P is fabricated with the integrated-circuit technology node of 16 nm, and 7 nm is expected for the next generation of FPGAs.Radiation tolerance can be achieved by using embedded FPGA (eFPGA) technology to integrate the intellectual property (IP) core of the FPGA into an ASIC.Table 2. Resource usage according to vitis hls 2020.2 synthesis for three values of (L − 2) , the number of intermediate sensor layers.The quality-control module is excluded from these syntheses because its resource usage scales differently with (L − 2) .The usage for the rest of the circuit is proportional to (L − 2) , as expected since the other modules are repeated for each intermediate layer.

Figure 2 .Figure 3 .
Figure 2. Block diagram of a pipelined MF8 corresponding to an MF built for t = 3.

Figure 7 .
Figure 7. Results of a high-statistics C simulation test.Distributions of the difference σ c ≡ (c reconstructed − c truth ) and σ ≡ ( reconstructed − truth ) are shown for ten million simulated particles,where c refers to the curvature of the trajectory in the azimuthal dimension and refers to the cotangent of the polar angle in the longitudinal dimension.The curvature is defined as c ≡ q/p T where q is the particle charge and p T is its momentum component transverse to the beam collision axis.The curvature and distributions are generated uniformly over the intervals [−0.1, 0.1] GeV −1 and [−0.8, 0.8] respectively.The curvature resolution is 7.9 TeV −1 and the resolution is 0.25 ‰. https://doi.org/10.1038/s41598-024-60319-9 ) .e. both nodes agree on their mutual link.Hence, after each iteration of pruning and consensus, the number of surviving links at each node is somewhat smaller than 2 t × 2 t . i for correctly-reconstructed particles peak well below the value of 10 ( log 10 [metric]< 1 ), but have a second peak well above this value when the algorithm misses one or more correct hits.With this quality requirement, the algorithm's efficiency is still 99.94% (the inefficiency for true particles increases from 0.05% to 0.06%), and the spurious trigger rate is (0.3 ± 0.2 stat ) per million wedges.With the ≈ 2000 wedges needed for coverage of the pixel detector, the expected spurious trigger rate is O(0.1%) per bunch crossing or O(40 kHz).
Vol.:(0123456789) Scientific Reports | (2024) 14:10181 | https://doi.org/10.1038/s41598-024-60319-9www.nature.com/scientificreports/ Figure 9.(left) LUT usage of the synthesized MF module as a function of 2 t , the number of inputs.(right) LUT usage of the synthesized MSS module as a function of 2 t t 2 , where 2 t is the number of inputs to be sorted.The open circles show the estimates from vitis hls for t ∈ {1, 2, 3, 4} respectively.The line represents the best linear fit to the point estimates.